An Introduction in R
Open to everyone affiliated with IFAS
Provides free assistance with:
Experimental design, including sample size & power calculations
Data analysis, providing conceptual or applied guidance
Interpretation and reporting of results, including preparation of tables and figures
Submit a consulting request at: https://bit.ly/3Goicy9
Teams Page: useful articles, code snippits, place to ask general stats questions.
Seminar Series: Basic, software agnostic introductions to methods, concepts and best practices in statistics. First Friday (usually) of every month @ 3 PM in 426 McCarty C and on Zoom.
Model Trends Using Regression
Researchers in agronomy frequently perform dose- or rate-response studies. Regression models, rather than ANOVA, are generally the most appropriate way to analyse these data.
Best Model = Good Fit to Data + Useful to Researcher (+ Justified by Theory)
Non-linear regression models can provide estimates of a parsimonious set of easily interpreted and biologically meaningful parameters, in a way that linear and polynomial regression sometimes cannot.
Don’t Break Up Your Analysis, Build Out Your Model
In general, it is better to analyse the data from all treatments/years/locations together, rather than separately. This slightly complicates model specification but (1) maximizes the precision of your estimates and, (2) greatly facilitates (and improves the statistical power of) subsequent testing.
Choose Versatile Software Tools
It is better to become familiar with general-purpose non-linear modeling packages (e.g., nlme) than to rely on more user friendly but much less flexible, domain-specific R packages.
Breathe
Fitting non-linear models requires optimization algorithms which are not guaranteed to find a solution, and the flexibility of non-linear modeling increases the scope for model misspecification/user error. Non-linear modeling thus requires more patience than when working with linear models, especially at the begining.
Observed Value = Deterministic Trend + Random Variation
\[ Y = f(x) + \epsilon \]
\[ \epsilon \sim \mathrm{Normal}(0, \sigma^2) \]
Henceforth, we are only concerned with \(f(x)\)
Residuals assumed to be normally distributed, independent, with uniform variance
Tools do exist to relax these assumptions/extend this model, but are beyond our scope
Equation:
\[ f(x) = \beta_0 + \beta_1x \]
\(\beta_0\): intercept (i.e value of y when x = 0)
\(\beta_1\): slope (i.e. change in y per unit change in x)
Equation:
\[ f(x) = \alpha_{0} + \frac{(x - \bar{x})}{\alpha_1} \]
\(\alpha_0\): intercept (i.e value of y when x = mean(x))
\(\frac{1}{\beta_1} = \alpha_1\): rate (i.e. change in x which causes 1 unit change in y)
Equation:
\[ f(x) = \beta_{0i} + \beta_{1}x \]
\(\beta_0\): intercept for the i-th group
\(\beta_{1i}\): slope for all groups
Design matrices (X) encode factor effects (\(\mathbf{\beta_1}\)) in columns of dummy variables
There are several, somewhat arbitrary, ways to construct them depending on the contrast coding being used
The choice of contrast coding thus determines interpretation of coefficients and reasonable starting values (essentially, reparameterizations of the linear model)
Matrix algebra notation for a linear model:
\(Y = \mathbf{X}\beta + \epsilon\)
For example, with 2 reps each of 2 treatments
\(\begin{bmatrix} Y_{1,1} \\ Y_{1, 2} \\ Y_{1, 1} \\ Y_{2, 2} \end{bmatrix}\) \(=\) \(\begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 0 \\ 1 & 0 \\ \end{bmatrix}\) \(\times\) \(\begin{bmatrix} \ \beta_0 & \beta_1 \end{bmatrix}'\) \(+\) \(\begin{bmatrix} \epsilon_{1,1} \\ \epsilon_{1,1} \\ \epsilon_{2, 1} \\ \epsilon_{2, 2} \end{bmatrix}\)
“Treatment Contrasts” (R default)
Intercept: value for the first factor level (a control, perhaps?)
Other coefficients: difference between the intercept and that factor level’s value
grp (Intercept) trtB
A 1 0
A 1 0
A 1 0
B 1 1
B 1 1
B 1 1
“SAS Contrasts”
Intercept: value for the last factor level (a control, perhaps?)
Other coefficients: difference between the intercept and that factor level’s value
grp (Intercept) trtA
A 1 1
A 1 1
A 1 1
B 1 0
B 1 0
B 1 0
“Sum-to-Zero Contrasts”
Intercept: average value across all factor levels
Other coefficients: difference between the intercept and that factor level’s value
grp (Intercept) trt1
A 1 1
A 1 1
A 1 1
B 1 -1
B 1 -1
B 1 -1
“No Intercept”
grp trtA trtB
A 1 0
A 1 0
A 1 0
B 0 1
B 0 1
B 0 1
Equation:
\[ f(x) = \beta_{0i} + \beta_{1i}x \]
\(\beta_0\): intercept for the i-th group
\(\beta_{1i}\): slope for the i-th group
Equation:
\[ f(x) = \beta_{0} + \beta_{1}x + \beta_2x^2 \]
\(\beta_0\): intercept
\(\beta_{1}\): slope (when x is 0?)
\(\beta_2\): “deceleration” (?)
Equation:
\[ f(x) = \alpha_1 + \kappa(x - \alpha_0)^2 \]
\(\alpha_0\): optimum x value (i.e. x value which maximizes y)
\(\alpha_{1}\): maximum y value
\(\kappa\): related to “steepness” of the curve
Technical Definition: Functions with a first derivative with respect to at least one parameter which is a function of another parameter (takeaway is that its not actually about “curviness”, and some functions can be linear or nonlinear depending on the parameterization).
Plain(-ish) Language Summary: Functions which exhibit some combination of limiting behavior, such as asymptotically approaching an upper or lower limit, critical points, such as a maximum or minimum, and inflection points, where a trend switches from accelerating to decelerating.
Target rate, \(T_p\), will be x value that achieves some percent, P, of the asymptote:
\[ T_p = \frac{\beta P}{1-P} \]
or, if prices, \(C_x\) and \(C_y\), are known for x and y, respectively, then you can calculate net returns (\(I_{net}\)):
\[ I_{net} = \frac{C_y\alpha x}{\beta + x} - xC_x \] \(\alpha\): asymptote (y value as \(x \rightarrow \infty\))
\(\beta\): half-maximum (x value where y = \(\frac{\alpha}{2}\))
First, run the following (after installing the required packages, as necessary) to set up our R environment:
# Load packages
library(agridat)
library(tidyverse)
library(nlme)
library(ggResidpanel)
library(emmeans)
library(marginaleffects)
library(car)
# set ggplot aesthetics
my_pal = c('steelblue', 'sienna', 'olivedrab', 'goldenrod')
my_theme = theme_classic()+
theme(text = element_text(size = 12),
legend.position = 'bottom')
options(ggplot2.discrete.colour = my_pal,
ggplot2.discrete.fill = my_pal)
theme_set(my_theme)This data is adapted from Schabenberger and Pierce (2001) and originally reported in Mead (1970). It contains the yield (\(\mathrm{kg/m^2}\)) of three onion (Allium cipa L.) cultivars when planted at various densities (\(\mathrm{plants/m^2}\)). Schabenberger and Pierce (2001) fit an alternative parameterization of the Michaelis-Menton function and referred to it by a different name.
# Load and prep data
onion = data.frame(
Cultivar = factor(rep(LETTERS[1:3], each = 10)),
Density = c(33.045, 35.629, 64.261, 75.24, 93.323,
144.129, 192.243, 232.178, 309.678, 334.542,
23.035, 28.524, 40.903, 56.403, 84.281,
93.861, 108.823, 173.084, 228.41, 276.74,
26.694, 37.997, 47.899, 67.059, 88.587,
103.226, 181.587, 201.177, 277.063, 326.469),
Yield = c(3.49, 3.185, 4.562, 4.537, 4.442,
5.434, 5.825, 5.619, 6.441, 6.189,
3.031, 3.112, 3.833, 4.072, 4.475,
4.665, 4.114, 5.764, 5.596, 5.064,
3.118, 3.48, 3.482, 3.541, 4.323,
4.036, 5.502, 4.868, 5.541, 5.321))
str(onion, width = 60, strict.width = 'cut')'data.frame': 30 obs. of 3 variables:
$ Cultivar: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1..
$ Density : num 33 35.6 64.3 75.2 93.3 ...
$ Yield : num 3.49 3.18 4.56 4.54 4.44 ...
# Michaelis-Menton function
micmen = function(x, ymax, xhalf){
ymax*x/(xhalf + x)
}
# Considering just cultivar A
onion |>
filter(Cultivar == 'A') |>
ggplot(aes(x = Density, y = Yield, color = Cultivar))+
# Plot the data itself
geom_point(size = 3) +
# Plot linear trend
geom_smooth(method = 'lm', se = F, size = .5) +
# Plot Mitscherlich trend, using this
# to identify reasonable starting values
geom_function(fun = micmen,
args = list(ymax = 7,
xhalf = 40),
size = 0.5, linetype = 2)First, fit the linear model (this could be done using lm).
# Fit the linear (ANCOVA) regression model
onion_mod_lin = gnls(model = Yield ~ yinit + slope*Density,
data = onion,
params = yinit + slope ~ 0 + Cultivar,
start = c(yinit = c(3.5, 3.5, 3.5),
slope = c(0.01, 0.01, 0.01)))
print(onion_mod_lin)Generalized nonlinear least squares fit
Model: Yield ~ yinit + slope * Density
Data: onion
Log-likelihood: -16.46228
Coefficients:
yinit.CultivarA yinit.CultivarB yinit.CultivarC slope.CultivarA slope.CultivarB
3.579669886 3.375371187 3.229866885 0.009197382 0.008951270
slope.CultivarC
0.008037759
Degrees of freedom: 30 total; 24 residual
Residual standard error: 0.4683104
Next, fit the Mitscherlich model:
# Fit the Michaelis-Menton model
onion_mod_mm = gnls(model = Yield ~ ymax*Density/(xhalf + Density),
data = onion,
params = ymax + xhalf ~ 0 + Cultivar,
start = c(curv = c(40, 40, 40),
ymax = c(7, 7, 7)))
print(onion_mod_mm)Generalized nonlinear least squares fit
Model: Yield ~ ymax * Density/(xhalf + Density)
Data: onion
Log-likelihood: -6.436848
Coefficients:
ymax.CultivarA ymax.CultivarB ymax.CultivarC xhalf.CultivarA xhalf.CultivarB
6.863395 5.849601 5.847467 37.721908 23.875150
xhalf.CultivarC
30.484627
Degrees of freedom: 30 total; 24 residual
Residual standard error: 0.3352747
\(\Delta\)AIC strongly favors of the non-linear model. The diagnostic plots of the residuals indicate possible minor violations of the assumptions of normality and homoscedasticity, which in other circumstances I would probe into further, but that is beyond the scope of this workshop.
All the tools from the emmeans package will work with models fit with gnls, you simply need to specify the parameter of interest via the argument param:
model term df1 df2 F.ratio p.value
Cultivar 2 23 4.316 0.0256
Cultivar emmean SE df lower.CL upper.CL
A 6.86 0.287 23 6.27 7.46
B 5.85 0.272 23 5.29 6.41
C 5.85 0.274 23 5.28 6.41
Confidence level used: 0.95
For derived quantities such as \(T_{90\%}\), we can use the delta method to generate estimates, standard errors, and confidence intervals:
[1] "ymax.CultivarA" "ymax.CultivarB"
[3] "ymax.CultivarC" "xhalf.CultivarA"
[5] "xhalf.CultivarB" "xhalf.CultivarC"
# Calculate 90% target rate for each cultivar, and
# output the results
t90a = deltaMethod(object = onion_mod_mm,
g. = 'xhalf.CultivarA*0.9/(1 - 0.9)',
func = 'T90% for Cultivar A:')
t90b = deltaMethod(object = onion_mod_mm,
g. = 'xhalf.CultivarB*0.9/(1 - 0.9)',
func = 'T90% for Cultivar B:')
t90c = deltaMethod(object = onion_mod_mm,
g. = 'xhalf.CultivarC*0.9/(1 - 0.9)',
func = 'T90% for Cultivar C:')
rbind(t90a, t90b, t90c) Estimate SE 2.5 % 97.5 %
T90% for Cultivar A: 339.497 56.198 229.352 449.64
T90% for Cultivar B: 214.876 42.219 132.129 297.62
T90% for Cultivar C: 274.362 52.767 170.940 377.78
The marginaleffects package has additional tools for generating estimates and confidence band for the fitted trendline:
# Generate a new data set containing all combinations of
# cultivar levels and a fine grid of density values
toplot = expand.grid(Cultivar = factor(LETTERS[1:3]),
Density = 0:33*10)
# Generate predictions for these new cultivar and density
# combinations
onion_preds = avg_predictions(model = onion_mod_mm,
newdata = toplot,
by = c('Cultivar', 'Density'))
# Plot the results, along with the original data
ggplot(onion_preds, aes(x = Density, color = Cultivar, fill = Cultivar)) +
facet_grid(cols = vars(Cultivar))+
geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
color = NA, alpha = 0.2) +
geom_line(aes(y = estimate))+
geom_point(aes(y = Yield), data = onion, size = 2)Equation:
\[ f(x) = \alpha_0 + \frac{\alpha_1 - \alpha_0}{1+e^{-\beta(\mathrm{ln}(x)-\mathrm{ln}(EC_{50}))}} \]
\(\alpha_0\) is the intercept
\(\alpha_1\) is the lower limit
\(\beta\) is the “slope”
\(EC_{50}\) is half-maximal effective concentration
The “pigweed” experiment - originally from Ferguson, Hamill, and Tardif (2001) and republished by Bowley (2015) - examined the response (as percent survival) exhibited by two biotypes of Amaranthus powellii S. Watson when exposed to increasing doses of the herbicide imazethapyr.
pigweed = data.frame(
Biotype = factor(rep(LETTERS[1:2], each = 36)),
Rate = c(rep(c(1, 4, 8, 16, 32,
64, 128, 256, 512),
each = 4),
rep(c(0.001, 0.063, 0.125, 0.25,
0.5, 1, 2, 4, 16),
each = 4)),
Block = factor(rep(c(1:4), 18)),
Pct_Survival = c(100, 100, 100, 100, 100, NA,
89.44, 61.53, 97.7, 100, 81.74,
61.04, 57.47, 100, 82.42, 73.71,
79.31, NA, 52.82, 69.51, 66.67,
65.44, 55.38, 51.56, 83.91,
76.5, 43.85, 42.18, 68.97, NA,
22.44, 34.62, 51.72, 60.14,
14.54, 15.85, NA, 100, 100, 100,
87.5, 74.3, 97.37, 42.87, 38.28,
50.37, 31.21, 27.17, 42.19, 38.68,
44.3, 45.87, 35.16, 19.65, 28.4,
27.35, 12.5, 14.31, 17.57, 39.44,
7.81, 32.88, 17.63, 22.87, 8.59,
21.87, 19.36, 24.06, 11.72, 3.84,
12.95, 19.89))
str(pigweed, width = 60, strict.width = 'cut')'data.frame': 72 obs. of 4 variables:
$ Biotype : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1..
$ Rate : num 1 1 1 1 4 4 4 4 8 8 ...
$ Block : Factor w/ 4 levels "1","2","3","4": 1 2 3..
$ Pct_Survival: num 100 100 100 100 100 ...
# Define Log-Logistic function
loglogistic = function(x, ymax, ymin, slope, IC50){
ymin + (ymax-ymin)/(1+exp(-b*(log(x)-log(IC50))))
}
ggplot(pigweed, aes(x = (Rate), y = Pct_Survival, color = Biotype))+
# Plot the data itself
facet_grid(cols = vars(Biotype), scales = 'free')+
geom_point() +
# Plot the log-logistic function
geom_function(fun = loglogistic,
args = list(ymin = 30, ymax = 100,
IC50 = 200, slope = -50))Fit the log-logistic model:
pigweed_mod_ll = gnls(Pct_Survival ~ ymin + (ymax - ymin)/(1+ exp(slope*(log(Rate) - log(EC50)))),
data = pigweed,
params = list(ymin + ymax ~ 1,
slope + EC50 ~ Biotype - 1),
start = list(ymin = 0, ymax = 100,
slope = c(.5, .5),
EC50 = c(1, 0.1)),
na.action = na.omit)
print(pigweed_mod_ll)Generalized nonlinear least squares fit
Model: Pct_Survival ~ ymin + (ymax - ymin)/(1 + exp(slope * (log(Rate) - log(EC50))))
Data: pigweed
Log-likelihood: -273.9753
Coefficients:
ymin ymax slope.BiotypeA slope.BiotypeB EC50.BiotypeA
16.38695930 102.71638822 0.57501493 1.19270649 76.73484213
EC50.BiotypeB
0.08745889
Degrees of freedom: 68 total; 62 residual
Residual standard error: 14.24344
Fit the hyperbolic model:
pigweed_mod_hy = gnls(Pct_Survival ~ ymin + slope*(y0 - ymin)/(slope+Rate),
data = pigweed,
params = list(ymin + y0 + slope ~ 0+Biotype),
start = list(ymin = c(20, 7),
y0 = c(100, 100),
slope = c(20, 0.25)),
na.action = na.omit)
print(pigweed_mod_hy)Generalized nonlinear least squares fit
Model: Pct_Survival ~ ymin + slope * (y0 - ymin)/(slope + Rate)
Data: pigweed
Log-likelihood: -274.9822
Coefficients:
ymin.BiotypeA ymin.BiotypeB y0.BiotypeA y0.BiotypeB slope.BiotypeA
34.21815082 14.56277604 94.52104250 102.68822396 50.30093630
slope.BiotypeB
0.08757621
Degrees of freedom: 68 total; 62 residual
Residual standard error: 14.45592
Fit the negative exponential model:
pigweed_mod_ne = gnls(Pct_Survival ~ ymin + (yinit - ymin)*exp(-Rate/slope),
data = pigweed,
params = list(yinit + ymin + slope ~ 0+Biotype),
start = list(ymin = c(35, 15),
yinit = c(100, 100),
slope = c(70, 0.25)),
na.action = na.omit)
print(pigweed_mod_ne)Generalized nonlinear least squares fit
Model: Pct_Survival ~ ymin + (yinit - ymin) * exp(-Rate/slope)
Data: pigweed
Log-likelihood: -277.1073
Coefficients:
yinit.BiotypeA yinit.BiotypeB ymin.BiotypeA ymin.BiotypeB slope.BiotypeA
90.2168592 100.2895962 38.8751801 20.0059165 89.7827055
slope.BiotypeB
0.1316581
Degrees of freedom: 68 total; 62 residual
Residual standard error: 14.91482
AIC suggests that the log-logistic is the best fitting model, while visual check of the residuals does not suggest there are violations of the assumptions of normality and homoscedasticity.
All the tools from the emmeans package will work with models fit with gnls, you simply need to specify the parameter of interest via the argument param:
model term df1 df2 F.ratio p.value
Biotype 1 61 4.702 0.0340
Biotype emmean SE df lower.CL upper.CL
A 76.7348 35.3590 61 6.0302 147.439
B 0.0875 0.0224 61 0.0427 0.132
Confidence level used: 0.95
The marginaleffects package has additional tools for generating estimates and confidence band for the fitted trendline:
# Generate a new data set containing all combinations of
# Biotype and fine grid of Rate values. Note to remove
# rows with biotype B and high rates or it messes up
# plots later
toplot = expand.grid(Biotype = factor(LETTERS[1:2]),
Rate = exp(seq(-6.9, 6.3, .1))) |>
filter(!(Biotype == 'B' & Rate > 17))
# Generate predictions for these new cultivar and density
# combinations
pigweed_preds = avg_predictions(model = pigweed_mod_ll,
newdata = toplot,
by = c('Biotype', 'Rate'))
# Plot the results, along with the original data
ggplot(pigweed_preds, aes(x = Rate, color = Biotype, fill = Biotype)) +
facet_grid(cols = vars(Biotype), scales = 'free')+
geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
color = NA, alpha = 0.2) +
geom_line(aes(y = estimate))+
geom_point(aes(y = Pct_Survival), data = pigweed, size = 2)nlme package. Some chapters are highly mathematical, others very much applied. Most examples are not agriculture related, though.